Construct a multi-lingual speech corpus in taiwan with extracting phonetically balanced articles

نویسندگان

  • Min-Siong Liang
  • Dau-Cheng Lyu
  • Yuang-Chin Chiang
  • Ren-Yuan Lyu
چکیده

In this paper, we describe an initial stage to construct a multilingual speech corpus in Taiwan with selecting phonetically balanced scripts. It is expected to collect a multilingual speech corpus covering three most frequently used languages in Taiwan, including Taiwanese (Min-nan), Hakka, and Mandarin Chinese. To achieve the objective, constructing a multilingual phonetic alphabet, namely Formosa Phonetic Alphabet (ForPA), is the first step. In addition, the multilingual lexicons (Fomosa Lexicons) are also important parts for building the corpus. Recently, this corpus containing 2,300 speakers’ speech database has been finished and is ready to be released. It contains about 200 hours of speech in 154,000 utterances.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

VOXMEX Speech Database: Design of a Phonetically Balanced Corpus

We present a method for designing a phonetically balanced speech corpus. In this method, we used a phonotactic approach to design the phonetic content of VOXMEX: a phonetically balanced corpus for Mexican Spanish. The transcriptions of VOXMEX contain a complete coverage of phonemes and allophones of Mexican Spanish in every possible context. This corpus is designed for doing phonetic research a...

متن کامل

Experiments on the Construction of a Phonetically Balanced Corpus from the Web

The construction of a speech recognition system requires a recorded set of phrases to compute the pertinent acoustic models. This set of phrases must be phonetically rich and balanced in order to obtain a robust recognizer. By tradition, this set is defined manually implicating a great human effort. In this paper we propose an automated method for assembling a phonetically balanced corpus (set ...

متن کامل

The design of the newspaper-based Japanese large vocabulary continuous speech recognition corpus

In this paper we present the first public Japanese speech corpus for large vocabulary continuous speech recognition (LVCSR) technology, which we have titled JNAS (Japanese Newspaper Article Sentences). We designed it to be comparable to the corpora used in the American and European LVCSR projects. The corpus contains speech recordings (60 hrs.) and their orthographic transcriptions for 306 spea...

متن کامل

AMISCO: The Austrian German Multi-Sensor Corpus

We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-moving speakers to facilitate the evaluation of estimators and detectors that jointly detect a speaker’s spatial and temporal parameters. The corpus is suitable for various machine learning and signal processing tasks, linguistic studies, and studies related to a speaker’s fundamental frequency (due to ...

متن کامل

Designing the Latvian Speech Recognition Corpus

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004